1
00:00:00,000 --> 00:00:05,854
[MUSIC]

2
00:00:05,854 --> 00:00:06,816
Hi everyone.

3
00:00:06,816 --> 00:00:10,459
Welcome back to the Heterogeneous Parallel
Programming class.

4
00:00:10,459 --> 00:00:13,915
This is lecture 1.6, introduction to CUDA.

5
00:00:13,915 --> 00:00:18,871
We will be discussing the kernel-based
SPMD parallel programming.

6
00:00:18,871 --> 00:00:21,511
The objective of this lecture is to help

7
00:00:21,511 --> 00:00:24,471
you to learn the basic concepts involved
in a

8
00:00:24,471 --> 00:00:27,751
simple CUDA kernel function, how do we
declare a

9
00:00:27,751 --> 00:00:30,871
kernel, how do we use built-in variables
in a

10
00:00:30,871 --> 00:00:36,290
kernel, and also to review how we can use
map thread index into data index.

11
00:00:38,730 --> 00:00:45,060
Here is a very simple example of how we
can declare a CUDA kernel.

12
00:00:45,060 --> 00:00:47,322
And this is the vector addition kernel
that

13
00:00:47,322 --> 00:00:50,790
we will eventually be launching from the
host code.

14
00:00:50,790 --> 00:00:54,270
But there's several important parts in
this kernel.

15
00:00:54,270 --> 00:01:00,230
The first one is the keyword.
Every CUDA kernel function

16
00:01:00,230 --> 00:01:06,322
needs to be preceded by __global__ and

17
00:01:06,322 --> 00:01:12,409
this signals to the compiler that the
function is a kernel function.

18
00:01:14,620 --> 00:01:18,650
The actual header of the function is
exactly

19
00:01:18,650 --> 00:01:22,150
the same as any other traditional C
function.

20
00:01:22,150 --> 00:01:25,312
So we have the name in this case is
vecAddKernel,

21
00:01:25,312 --> 00:01:27,870
and it's just a name that we gave, we
happen

22
00:01:27,870 --> 00:01:30,590
to give to the kernel, and it's followed
by all

23
00:01:30,590 --> 00:01:34,460
the parameters that it it is expecting
from the caller.

24
00:01:37,000 --> 00:01:42,320
Within the kernel we will be able to use
several built-in variables.

25
00:01:42,320 --> 00:01:46,500
The variables that we shown in this in
this the code

26
00:01:46,500 --> 00:01:51,290
example are threadIdx.x, blockDim.x and

27
00:01:51,290 --> 00:01:56,392
blockIdx.x.
These are all built-in variables that are

28
00:01:56,392 --> 00:02:02,005
initialized by the hardware, so that every
thread can access these variables.

29
00:02:02,005 --> 00:02:06,565
And depending on the thread index, and
block index, every thread

30
00:02:06,565 --> 00:02:11,365
will have a different combination of the
threadIdx and blockIdx values.

31
00:02:11,365 --> 00:02:14,821
So that's why we will be able to assign

32
00:02:14,821 --> 00:02:20,336
a different value to i for every for
different threads.

33
00:02:20,336 --> 00:02:24,926
And we already have gone through this
concept in a previous lecture.

34
00:02:24,926 --> 00:02:27,187
Finally, we can use

35
00:02:27,187 --> 00:02:34,130
i as a data index to access the the input
data.

36
00:02:34,130 --> 00:02:38,230
So in this case, we'll be using i to
access one

37
00:02:38,230 --> 00:02:41,989
A element, one B element, and to write
into one C element.

38
00:02:43,540 --> 00:02:48,250
We're going to come back and discuss the
if statement, why we need to test

39
00:02:48,250 --> 00:02:52,380
what, if i is less than n, n is the number
of elements in

40
00:02:52,380 --> 00:02:54,250
the A, B, and C vectors.

41
00:02:57,920 --> 00:03:03,730
This slide shows the kernel launch code in
the host code.

42
00:03:03,730 --> 00:03:07,670
In the previous lecture we talk about the
allocation of

43
00:03:07,670 --> 00:03:11,900
memory, copying data to the device and
then we say

44
00:03:11,900 --> 00:03:15,110
that we are going to to present the kernel
launch

45
00:03:15,110 --> 00:03:19,190
part in this lecture and this is where it
happens.

46
00:03:19,190 --> 00:03:23,496
So we're going to change the second part
of the vecAdd

47
00:03:23,496 --> 00:03:27,850
function in the host code into a kernel
launch.

48
00:03:27,850 --> 00:03:31,880
The kernel launch is very similar to any C
function calls.

49
00:03:31,880 --> 00:03:34,900
So we still have the C function name, so
this is the

50
00:03:34,900 --> 00:03:39,380
vecAddKernel, so this is just the name we
assign to the kernel.

51
00:03:39,380 --> 00:03:44,700
But between the function name and the
parameters that were

52
00:03:44,700 --> 00:03:49,410
given to the function we also need to
supply a configuration

53
00:03:49,410 --> 00:03:52,900
parameter set to the, to the kernel.

54
00:03:52,900 --> 00:03:57,500
So we actually have two configuration
parameters.

55
00:03:57,500 --> 00:04:01,550
The first one is the number of blocks in
the grid

56
00:04:01,550 --> 00:04:05,438
and the second one is the number of
threads in a block.

57
00:04:05,438 --> 00:04:09,320
So what we're saying in this particular
statement is that

58
00:04:09,320 --> 00:04:14,470
we want to have 256 threads in each thread
block.

59
00:04:14,470 --> 00:04:20,900
And we want to have n divided by 256.0,

60
00:04:20,900 --> 00:04:27,230
and with the ceiling function, number of
blocks in the in, in the grid.

61
00:04:27,230 --> 00:04:33,090
So what's happening here is that whenever
we want, we generate threads,

62
00:04:33,090 --> 00:04:37,540
we want to have enough threads to cover
all the vector elements.

63
00:04:37,540 --> 00:04:39,600
So if we have

64
00:04:39,600 --> 00:04:44,030
256 vector elements, then it's a very
simple case.

65
00:04:44,030 --> 00:04:48,420
Because it's 256 threads per block, and we
only

66
00:04:48,420 --> 00:04:50,750
need to have one block to cover all the
256.

67
00:04:50,750 --> 00:04:57,500
So we will have 1 and 256 as our
configuration parameter.

68
00:04:57,500 --> 00:05:03,730
So, if n is equal to 256, we can simply
divide it by 256 and we'll get value 1.

69
00:05:03,730 --> 00:05:05,080
If n

70
00:05:05,080 --> 00:05:10,310
is 512, it's also simple, because 512
divided

71
00:05:10,310 --> 00:05:15,320
by 256 is 2.
So, 2 thread blocks will be perfect for

72
00:05:15,320 --> 00:05:20,476
covering 512 elements.
The case, that is much more

73
00:05:20,476 --> 00:05:25,840
complicated is when, is, when n is not a
multiple of 256.

74
00:05:25,840 --> 00:05:31,249
Let's say if n is 1000.
When we divide 1000

75
00:05:31,249 --> 00:05:38,203
by 256 we get 3 point some fraction.
And this gives

76
00:05:38,203 --> 00:05:45,187
us a situation where we don't have a
perfect multiple of 256.

77
00:05:45,187 --> 00:05:52,003
So in order to, to have enough threads to
cover 1000 elements, we actually

78
00:05:52,003 --> 00:05:56,249
need to generate four thread blocks.
And so that the

79
00:05:56,249 --> 00:06:03,165
1024 threads altogether will give
sufficient coverage to 1000 elements.

80
00:06:03,165 --> 00:06:07,890
And, but then, we will end up with 24
threads that are extra.

81
00:06:07,890 --> 00:06:11,170
So in order to always generate enough
threads,

82
00:06:11,170 --> 00:06:15,640
we will divide n by 256.0, and we generate

83
00:06:15,640 --> 00:06:18,561
a floating point number, which is 3 point

84
00:06:18,561 --> 00:06:21,620
a fraction, and then we would take the
ceiling

85
00:06:21,620 --> 00:06:25,140
function to to be able to round

86
00:06:25,140 --> 00:06:29,400
up to 4.
So this ceiling function ensures that we

87
00:06:29,400 --> 00:06:35,010
always generate the next bigger integer of
blocks

88
00:06:35,010 --> 00:06:40,470
that will have enough threads to cover all
the elements in this vector.

89
00:06:40,470 --> 00:06:46,680
The reason why we use 256.0 is because we
need to make sure that we generate

90
00:06:46,680 --> 00:06:50,300
a floating point number, and that floating
number will be rounded up.

91
00:06:50,300 --> 00:06:51,030
If we didn't

92
00:06:53,500 --> 00:06:58,800
use 256.0, but rather if we use 256, the
division will be done as an integer

93
00:06:58,800 --> 00:07:01,770
division, so 1000 divided by 256 as an

94
00:07:01,770 --> 00:07:06,810
integer division, will actually generate a
value 3.

95
00:07:06,810 --> 00:07:10,270
And then the ceiling function would not be
able to take it up to 4.

96
00:07:10,270 --> 00:07:12,490
So this is one of the the ways to

97
00:07:12,490 --> 00:07:16,210
ensure that we can have, always have
enough threads.

98
00:07:16,210 --> 00:07:18,310
Now going back to the previous slide.

99
00:07:20,540 --> 00:07:24,200
When we launch, let's say, the kernel to

100
00:07:24,200 --> 00:07:28,350
process 1000 elements, we will have 4
thread blocks.

101
00:07:28,350 --> 00:07:32,824
And we will have a, the last thread block
will have

102
00:07:32,824 --> 00:07:38,060
threads that will cover up to a element
that element 999, which is the

103
00:07:38,060 --> 00:07:44,720
1000th element because the element start
from 0.

104
00:07:44,720 --> 00:07:45,950
However, we will

105
00:07:45,950 --> 00:07:52,740
have 24 threads, where the i value is
going to be greater than or equal to 1000,

106
00:07:52,740 --> 00:07:58,190
and these threads really do not correspond
to any real elements of A, B and C.

107
00:07:58,190 --> 00:08:01,850
So we don't want those elements to do
anything and in fact

108
00:08:01,850 --> 00:08:06,560
if they try to access memory, they can
actually corrupt the memory contents.

109
00:08:06,560 --> 00:08:11,090
So, what we want to do is, we want to be
able to test whether the i

110
00:08:11,090 --> 00:08:13,930
value for every, for the thread, is

111
00:08:13,930 --> 00:08:18,180
actually less than or equal, less than
1000.

112
00:08:18,180 --> 00:08:21,770
Only threads that succeed in this test
should

113
00:08:21,770 --> 00:08:24,820
do the vector, should do the element
addition.

114
00:08:24,820 --> 00:08:27,240
So this is a reason why we need to have
this check.

115
00:08:27,240 --> 00:08:30,190
And this is commonly referred to as
boundary check.

116
00:08:30,190 --> 00:08:33,610
And it's actually testing whether a thread

117
00:08:33,610 --> 00:08:36,160
is supposed to be processing an element
within

118
00:08:36,160 --> 00:08:38,990
the boundary or outside the boundary.

119
00:08:38,990 --> 00:08:42,650
So whenever the thread corresponds to an
element that is outside the

120
00:08:42,650 --> 00:08:47,380
boundary, then the thread will not be
doing anything in the computation.

121
00:08:50,350 --> 00:08:52,490
This slide shows two additional

122
00:08:54,510 --> 00:08:59,490
concepts for the kernel launch.
The first one is, we

123
00:08:59,490 --> 00:09:05,420
actually can have two dimensional and
three dimensional grids and blocks.

124
00:09:05,420 --> 00:09:08,778
So the particular method that we showed in
the previous

125
00:09:08,778 --> 00:09:14,110
slide only works for one dimensional,
launching one dimensional grids.

126
00:09:14,110 --> 00:09:18,910
So each of those parameters are actually a
integer value.

127
00:09:19,970 --> 00:09:24,110
Here, we show that a, a general method for
launching a

128
00:09:24,110 --> 00:09:29,380
grid of 2 dimension, 3 dimension, or in
some cases, 1 dimension.

129
00:09:29,380 --> 00:09:33,840
So we, now we need to declare true
dimensional variables.

130
00:09:33,840 --> 00:09:41,790
They're all typed dim3, and dim3 is
available in the CUDA header file.

131
00:09:41,790 --> 00:09:46,080
And dim3 only means that a variable has
three parts.

132
00:09:46,080 --> 00:09:51,520
Each part is a integer value.
And we can initialize the variable with

133
00:09:51,520 --> 00:09:56,830
the syntax that we have here by having the
variable name followed by parenthesis.

134
00:09:56,830 --> 00:10:01,740
So the first value will initialize the x
value, the second value will

135
00:10:01,740 --> 00:10:07,170
initialize the y value, and the third
value will initialize the z value.

136
00:10:07,170 --> 00:10:11,550
And in this case, because we're using a
1-dimensional grid,

137
00:10:11,550 --> 00:10:17,290
we only need to initialize the x-values of
both the DimGrid and DimBlock.

138
00:10:17,290 --> 00:10:20,550
These variable names are of your choice.

139
00:10:20,550 --> 00:10:22,980
You can call them anything you want, as
long as

140
00:10:22,980 --> 00:10:26,980
you use them in the right position of the
kernel launch.

141
00:10:26,980 --> 00:10:32,330
So in this case, we call DimGrid the
configuration that

142
00:10:32,330 --> 00:10:36,240
we want for the number of blocks in a
grid.

143
00:10:36,240 --> 00:10:36,620
And then we

144
00:10:36,620 --> 00:10:42,510
have DimBlock as the number of threads in
in the thread block.

145
00:10:42,510 --> 00:10:48,250
So we can use this as a more general way
of launching kernels.

146
00:10:48,250 --> 00:10:51,280
The second one is, I'm actually giving you
an alternative

147
00:10:51,280 --> 00:10:55,970
way to generate enough thread blocks to
cover all the elements.

148
00:10:55,970 --> 00:11:01,750
Here you can use n minus one divided by
256, in this case 256

149
00:11:01,750 --> 00:11:06,060
can be a integer.
So this will give us a integer

150
00:11:06,060 --> 00:11:10,730
value, and then you'll plus 1, and this
will always give us

151
00:11:10,730 --> 00:11:15,840
the next bigger integer that will cover
the, all the threads.

152
00:11:15,840 --> 00:11:21,280
So if we go back to the 1000 example, you
know, 1000

153
00:11:21,280 --> 00:11:26,850
divided by 1 will be 999, subtracted by 1
will

154
00:11:26,850 --> 00:11:30,760
be 999, and if you divide it by

155
00:11:30,760 --> 00:11:34,850
256, you will get 3.
And 3 plus 1 is 4.

156
00:11:34,850 --> 00:11:40,930
And the most important case is that if you
have, let's say, 256, what happens

157
00:11:40,930 --> 00:11:46,870
is, 256 minus 1 is 255, so you will get a
0 from that division, but we add

158
00:11:46,870 --> 00:11:52,730
1.
So everything from 0 to 256

159
00:11:52,730 --> 00:11:59,520
will give us exactly one block.
In the according to this division.

160
00:11:59,520 --> 00:12:02,030
So if you work out all these
possibilities, you'll

161
00:12:02,030 --> 00:12:04,870
realize that they all, this expression
always gives you

162
00:12:04,870 --> 00:12:09,870
the exact number the correct number of
thread blocks

163
00:12:09,870 --> 00:12:13,320
that will be big enough to cover all the
elements.

164
00:12:16,850 --> 00:12:23,880
So now let's put everything together.
The kernel execution works as follows.

165
00:12:23,880 --> 00:12:27,020
In the host code, you can, you have a

166
00:12:27,020 --> 00:12:30,180
host function, and then the host function
can launch

167
00:12:30,180 --> 00:12:33,360
a kernel by calling the kernel function
just like

168
00:12:33,360 --> 00:12:38,510
any other C function, but with additional
configuration parameters.

169
00:12:38,510 --> 00:12:42,020
The configuration parameters need to be
preceded by three

170
00:12:42,020 --> 00:12:45,230
less signs and followed by three greater
signs.

171
00:12:45,230 --> 00:12:50,180
And so once we launched a kernel, we call
the kernel function, then the

172
00:12:50,180 --> 00:12:55,870
hardware will generate a grid of threads
to execute the kernel function.

173
00:12:55,870 --> 00:12:58,610
Every thread will be executing this kernel
function.

174
00:12:59,980 --> 00:13:02,412
And the kernel function is declared in

175
00:13:02,412 --> 00:13:07,860
the CUDA program by preceding the kernel
function

176
00:13:07,860 --> 00:13:13,458
decoration with a key word __global__.

177
00:13:13,458 --> 00:13:18,860
So, this allows the host code and the
device code to connect, and so that

178
00:13:18,860 --> 00:13:24,880
whenever the host code launch this kernel,
it will generate this grid of threads.

179
00:13:24,880 --> 00:13:28,190
And all the threads in the grid will

180
00:13:28,190 --> 00:13:33,615
have these built in variables, blockIdx,
blockDim.x, and

181
00:13:33,615 --> 00:13:38,530
threadIdx.x.
And these predefined, preinitialized

182
00:13:38,530 --> 00:13:44,850
variables will allow the threads to
generate different data indices,

183
00:13:44,850 --> 00:13:50,490
so that each thread will be processing a
different part of the code, of the data.

184
00:13:50,490 --> 00:13:54,190
So now, I like to, to just kind of
summarize how

185
00:13:54,190 --> 00:13:58,180
you can treat all the function
declarations in your CUDA program.

186
00:13:58,180 --> 00:13:58,790
We already talk

187
00:13:58,790 --> 00:14:02,870
about the kernel function and the host
function.

188
00:14:02,870 --> 00:14:06,933
So there are really three types of
functions in a CUDA program.

189
00:14:06,933 --> 00:14:11,774
The first type is what we have not talked
about so far, which is device function.

190
00:14:11,774 --> 00:14:15,841
It's __device__ and this keyword tells the
compiler

191
00:14:15,841 --> 00:14:18,082
that the function is going to be a

192
00:14:18,082 --> 00:14:23,975
device function that can only, that can be
executed on a device, and it can be called

193
00:14:23,975 --> 00:14:27,846
by another function that is executing on
the device.

194
00:14:27,846 --> 00:14:30,658
This kind of function is typically called
by

195
00:14:30,658 --> 00:14:33,827
a kernel function or by other device
functions.

196
00:14:33,827 --> 00:14:38,792
The second type is what we talk about the
most so far, the kernel function.

197
00:14:38,792 --> 00:14:41,911
So the kernel function is a very special
device function.

198
00:14:41,911 --> 00:14:46,136
It can be executed on the device, but it
can only be called by the host.

199
00:14:46,136 --> 00:14:49,969
This is really the gateway of parallel
execution.

200
00:14:49,969 --> 00:14:54,699
When the, when the host code wants to
trigger any kind of parallel

201
00:14:54,699 --> 00:15:00,320
execution, it launches a kernel function.
And the third type is the host function.

202
00:15:00,320 --> 00:15:05,536
It can be optionally preceded by __host__.

203
00:15:05,536 --> 00:15:10,000
And these kind of functions are just the
traditional C functions that will

204
00:15:10,000 --> 00:15:13,848
be executed on the host, and can only be
called from the host.

205
00:15:13,848 --> 00:15:15,040
So, a few

206
00:15:15,040 --> 00:15:20,222
important details.
__global__ declares

207
00:15:20,222 --> 00:15:25,381
that kernel function.
And make sure that you remember the,

208
00:15:25,381 --> 00:15:30,281
there are two underscores before and two
underscores after.

209
00:15:30,281 --> 00:15:35,369
And the kernel function must return a void
type, because these

210
00:15:35,369 --> 00:15:40,171
functions are special.
These functions are really used to

211
00:15:40,171 --> 00:15:44,090
trigger parallel execution on the device.

212
00:15:44,090 --> 00:15:48,060
There's not an easy, simple way to to
return a value.

213
00:15:48,060 --> 00:15:51,480
The question is, which thread would even
give you that value back.

214
00:15:51,480 --> 00:15:56,710
So that's why to make it simple, the
kernel functions must always return void.

215
00:15:56,710 --> 00:15:58,798
And the second one is, the device,

216
00:15:58,798 --> 00:16:05,540
__device__ and __host__ can be used
together and

217
00:16:05,540 --> 00:16:11,260
this gives you a function that will be
compiled in both ways.

218
00:16:11,260 --> 00:16:13,740
It will be compiled into a host function,

219
00:16:13,740 --> 00:16:16,990
it will be also compiled into a device
function.

220
00:16:16,990 --> 00:16:22,090
This way the function can be appropriately
used in both the host and device.

221
00:16:22,090 --> 00:16:26,030
And this is oftentimes used, you know used
to generate useful

222
00:16:26,030 --> 00:16:30,280
utility functions that could be used on
both device and host.

223
00:16:30,280 --> 00:16:34,350
Finally, if the host, if a function is
truly only

224
00:16:34,350 --> 00:16:40,480
going to be a host function, then __host__
is really optional.

225
00:16:40,480 --> 00:16:43,680
And this allows you to port a C program

226
00:16:43,680 --> 00:16:47,750
into CUDA very quickly, because all the
original programs

227
00:16:47,750 --> 00:16:51,450
can just remain as, with no keywords, and
they

228
00:16:51,450 --> 00:16:53,740
will be assumed to run only on the host.

229
00:16:53,740 --> 00:16:56,000
And then you can gradually add keywords

230
00:16:56,000 --> 00:17:00,300
to the function declarations to make
kernel functions and device functions

231
00:17:00,300 --> 00:17:03,580
or make some of the functions both host
and device functions.

232
00:17:07,250 --> 00:17:12,700
Now that we know how to declare a kernel,
how to launch a kernel and so on, I

233
00:17:12,700 --> 00:17:17,710
would like to talk, tell you a little bit
more about how a CUDA program is compiled.

234
00:17:19,180 --> 00:17:21,570
In every file, we're going to have both
host

235
00:17:21,570 --> 00:17:25,510
function and device function, so the NVCC,
which is

236
00:17:25,510 --> 00:17:29,340
the Nvidia C compiler, will automatically
identify all the

237
00:17:29,340 --> 00:17:32,870
host functions and kernel functions and
separate them out

238
00:17:32,870 --> 00:17:35,420
into two compilation paths.

239
00:17:35,420 --> 00:17:37,130
The host functions will go through the
host

240
00:17:37,130 --> 00:17:40,150
code path, where they will be conferred or
compiled

241
00:17:40,150 --> 00:17:46,780
by a host C compiler and linker, such as
GCC, such as Intel ICC, and so on.

242
00:17:46,780 --> 00:17:51,780
And the second one path is for the device
function or the kernel functions.

243
00:17:51,780 --> 00:17:56,250
And these functions are compiled into a
binary file format called PTX.

244
00:17:56,250 --> 00:17:57,920
Now, this is very similar

245
00:17:57,920 --> 00:17:59,600
to Java bytecode.

246
00:17:59,600 --> 00:18:05,220
At run time, there is a just-in-time
compiler that will take the PTX binary and

247
00:18:05,220 --> 00:18:09,400
generate the real ISA, instruction set
architecture, binary

248
00:18:09,400 --> 00:18:12,190
for that particular device that you are
using.

249
00:18:12,190 --> 00:18:17,730
So then, the, both types of executables
will be used

250
00:18:17,730 --> 00:18:22,960
for this Heterogeneous Computing Platform,
and the host path will be

251
00:18:22,960 --> 00:18:27,925
used on the host device which is the CPU,
and then the device

252
00:18:27,925 --> 00:18:33,320
binary will be run on the device you know,
hardware, such as a GPU.

253
00:18:33,320 --> 00:18:37,640
And this allows you to take a single piece
of program with

254
00:18:37,640 --> 00:18:39,440
host and device functions and run

255
00:18:39,440 --> 00:18:42,740
them correctly in a heterogeneous
computing platform.

256
00:18:45,080 --> 00:18:49,390
So at this point, we have finished the
basic concepts

257
00:18:50,480 --> 00:18:55,430
of, involved in writing a simple, basic
CUDA program.

258
00:18:55,430 --> 00:19:00,620
And so I would like to encourage you to
begin to take the lab tour,

259
00:19:00,620 --> 00:19:05,660
and the lab tour will show you the the
environment

260
00:19:05,660 --> 00:19:10,590
for you to write the CUDA kernels and so
on

261
00:19:10,590 --> 00:19:12,900
for the for all the assignments.

262
00:19:12,900 --> 00:19:14,800
And then we would also show you how to get

263
00:19:14,800 --> 00:19:19,400
the instructions and how to submit the
your, your assignment.

264
00:19:20,650 --> 00:19:25,700
Once you finish that tour, I like you to
start encourage you to start, the

265
00:19:25,700 --> 00:19:30,690
first lab assignment which is the vector
addition CUDA code.

266
00:19:30,690 --> 00:19:36,490
So this will help you to solidify your
understanding of all the concepts

267
00:19:36,490 --> 00:19:38,200
that we discussed to date.

268
00:19:38,200 --> 00:19:44,480
For those of you who would like to learn
more about the CUDA language at the

269
00:19:44,480 --> 00:19:48,580
introductory level, I would like you to,
recommend

270
00:19:48,580 --> 00:19:51,760
that you read chapter 3 of the textbook.

271
00:19:51,760 --> 00:19:52,120
Thank you.